-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert some panics that happen on invalid parquet files to error results #6738
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.
I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...
parquet/src/file/metadata/mod.rs
Outdated
@@ -959,17 +959,18 @@ impl ColumnChunkMetaData { | |||
} | |||
|
|||
/// Returns the offset and length in bytes of the column chunk within the file | |||
pub fn byte_range(&self) -> (u64, u64) { | |||
pub fn byte_range(&self) -> Result<(u64, u64)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this change. btw i saw that the planned 54.0.0 will have "major api change", how is an api change considered as acceptable breaking change??
parquet/src/schema/types.rs
Outdated
@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize | |||
if !is_root_node { | |||
builder = builder.with_repetition(rep); | |||
} | |||
} else if !is_root_node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on the comment at line 1230 which says All other types must have one
, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it breaks something for someone
if it happened then someone must have already experienced panics. But i get your point, removed this change. thx
parquet/src/thrift.rs
Outdated
@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |||
let mut shift = 0; | |||
loop { | |||
let byte = self.read_byte()?; | |||
in_progress |= ((byte & 0x7F) as u64) << shift; | |||
let val = (byte & 0x7F) as u64; | |||
let val = val.checked_shl(shift).map_or_else( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very performance critical code path, this probably should use wrapping_shl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, changed to wrapping_shl, thx!
parquet/src/thrift.rs
Outdated
impl TInputProtocol for TCompactSliceInputProtocol<'_> { | ||
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> { | ||
unimplemented!() | ||
thrift_unimplemented!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unreachable, a panic is the correct thing to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is actually genuinely unreachable, we don't use thrift messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back to unimplemented
Thanks @tustvold for the review!
I think the changes are more about converting panics to errors, rather than actual code logic.
these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable. |
Hi @tustvold, I removed some changes based on your comment. PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few nits. However, I wonder if perhaps this should be multiple PRs with rationales given for each change.
let column_orders = | ||
Self::parse_column_orders(t_file_metadata.column_orders, &schema_descr)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this would currently panic, but would one ever prefer to just set column_orders
to None
and continue? The only impact AFAIK would be statistics being unusable, which would only matter if predicates were in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! i agree with the setting to None
idea. but i guess this worths a separate issue to discuss and fix.
parquet/src/file/statistics.rs
Outdated
if let Some(min) = min { | ||
if min.len() < len { | ||
return Err(ParquetError::General( | ||
"Insufficient bytes to parse max statistic".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Insufficient bytes to parse max statistic".to_string(), | |
"Insufficient bytes to parse min statistic".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this.
Thanks for the review! imho these changes share the same rationale in that they just convert panics to errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks correct to me. I personally would prefer errors to panics when processing multiple files.
Hi @tustvold could you please take another look at this one? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @jp0317 and @etseidl
This looks reasonable to me -- thank you 🙏
I also made a PR to test this out with the DataFusion tests as well: apache/datafusion#13820
Just as a way to double check this doesn't cause any unintended issues
I ran the parquet benchmarks to verify there is no performance implications with this PR My initial results seem to show no noticable pattern / difference in this PR than main. I will run it again to be sure. Details
|
My second performance run likewise shows now particular peformance changes (there is a lot of noise with some benchmarks being faster and some being slower) Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jp0317 and @etseidl
I reviewed this PR carefully, and I believe that it:
- It is better than the current code (Detects more errors / avoids panics on bad input)
- Unlikely to impact performance (changes are either added to the metadata parsing code, or called once per page)
However, I would like @tustvold to give it a final look (or say he won't have time to do so) before we merge it
@@ -67,7 +67,7 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |||
let mut shift = 0; | |||
loop { | |||
let byte = self.read_byte()?; | |||
in_progress |= ((byte & 0x7F) as u64) << shift; | |||
in_progress |= ((byte & 0x7F) as u64).wrapping_shl(shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this change?
As I understand it <<
panics in debug builds on overflow but wraps in release builds
It seems to me that this change now avoids panic'ing in debug builds too, which isn't obviously better to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for bringing this up! I'm not sure if <<
is guaranteed to always have the same behavior as wrapping_shl
in release builds. On the other hand, imho this is not really a logic bug, and the only benefit keeping <<
is that debug mode can be used to identify invalid inputs (via panics), but i'm not sure if anyone will rely on that in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the C++ code for this loop actually validates [total bytes decoded] (https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/protocol/TCompactProtocol.tcc#L759) which is probably a good idea (I think this prevents the panic that the original checked shift was meant to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the pointer. iiuc the c++ one may still have an overflow on the 10th bytes where it shifts 63 bits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding but I thoght shifting by 63 bits is well defined on u64? So overflow yes but I wouldn't expect a panic? Also I think silently passing here if there is an overflow would be data corrupt data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider this change independently if we are concerned about perf impacts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overflow yes, i agree we can add check on # of bytes decoded separately. For this cl let's just solve the panic ?
parquet/src/schema/types.rs
Outdated
@@ -556,7 +556,8 @@ impl<'a> PrimitiveTypeBuilder<'a> { | |||
} | |||
} | |||
PhysicalType::FIXED_LEN_BYTE_ARRAY => { | |||
let max_precision = (2f64.powi(8 * self.length - 1) - 1f64).log10().floor() as i32; | |||
let length = self.length.checked_mul(8).unwrap_or(i32::MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the overflow error instead of falling through to max precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be better to return an error:
I double checked that using i32::MAX results in
Max precision: 2147483647
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, changed to returning error, thanks!
Thank you for all the work here @jp0317 - I think we have bikeshed this PR enough and plan to merge it in tomorrow unless there are any objections FYI @tustvold I also dismissed your "request changes" review as from what I can see the changes you requested have been made. Please let me know if you disagree |
Which issue does this PR close?
This solves some of #6737.
Rationale for this change
Some code changes to replace some panics with proper errors
What changes are included in this PR?
Some codes that lead to panic are converted to returning error results.
Are there any user-facing changes?
Behavior change from panics to errors.